Skip to content

ducktape: retry apt/PPA installs in ocsf-server#30386

Open
andrewhsu wants to merge 1 commit into
devfrom
andrewhsu/devprod-4196-retry-ocsf-server-apt
Open

ducktape: retry apt/PPA installs in ocsf-server#30386
andrewhsu wants to merge 1 commit into
devfrom
andrewhsu/devprod-4196-retry-ocsf-server-apt

Conversation

@andrewhsu
Copy link
Copy Markdown
Member

@andrewhsu andrewhsu commented May 6, 2026

Wrap apt and add-apt-repository commands in tests/docker/ducktape-deps/ocsf-server with a retry_apt helper using exponential backoff (3 attempts, 30s/60s sleep) to ride out transient Canonical/Launchpad PPA outages observed in CDT.

Multiple cdt-aws and cdt-gcp nightly builds across 2026-05-03..05 failed at this exact step with Connection refused or connection timed out to ppa.launchpadcontent.net, or HTTP 504 from Launchpad's REST API during add-apt-repository's signing-key fetch:

Build Pipeline / job Failure surface
84028 redpanda / cdt-aws-internal-amd64 connection timed out
28906 vtools / cdt-gcp-nightly-amd64 Connection refused
28905 vtools / cdt-gcp-nightly-amd64 Connection refused
28904 vtools / cdt-aws-nightly-amd64 Connection refused
28902 vtools / cdt-aws-nightly-amd64 Connection refused
28900 vtools / cdt-gcp-nightly-amd64 Connection refused (downstream Elixir 1.12 vs ~> 1.14 mismatch)
28894 vtools / cdt-aws-nightly-arm64 connection timed out
28880 vtools / cdt-aws-nightly-amd64 HTTP 504 from Launchpad REST API

Worst-case wall-time added on the failure path: ~1m30s of sleep plus per-attempt apt time. Happy path: no observable change.

This is a transient mitigation. Eliminating the runtime dependency on Launchpad infrastructure (mirror erlang/elixir to a Redpanda-controlled host, or bake into the base AMI) is tracked separately in DEVPROD-4198.

[DEVPROD-4196]

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

Wrap apt and add-apt-repository commands with a retry_apt helper using
exponential backoff (3 attempts, 30s/60s sleep) to ride out transient
Canonical Launchpad PPA outages observed in CDT.

Multiple cdt-aws and cdt-gcp nightly builds across 2026-05-03..05 failed
at this exact step with `Connection refused` or `connection timed out`
to ppa.launchpadcontent.net, or `HTTP 504` from Launchpad's REST API
during `add-apt-repository` signing-key fetch. Affected: redpanda build
84028 plus vtools builds 28906, 28905, 28904, 28902, 28900, 28894, 28880.

Worst-case wall-time added on the failure path: ~1m30s of sleep plus
per-attempt apt time. Happy path: no observable change.

[DEVPROD-4196]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 17:47
@andrewhsu andrewhsu requested a review from a team as a code owner May 6, 2026 17:47
@andrewhsu andrewhsu requested review from PrzemekZglinicki and removed request for a team May 6, 2026 17:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves resilience of the ducktape OCSF server Docker build step by retrying apt/PPA operations that have been failing intermittently due to Canonical/Launchpad outages in CDT (DEVPROD-4196).

Changes:

  • Added a retry_apt helper that retries apt-related commands with exponential backoff (3 attempts, 30s/60s delays).
  • Wrapped apt-get update, apt-get install, and add-apt-repository invocations in tests/docker/ducktape-deps/ocsf-server with retry_apt to reduce flakiness.

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

CI test results

test results on build#84102
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL ShadowLinkTopicFailoverTests test_link_topic_failover {"source_cluster_spec": {"cluster_type": "redpanda"}, "storage_mode": "tiered_cloud", "with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/84102#019dfe67-6164-42df-a111-755ddad58c9b 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkTopicFailoverTests&test_method=test_link_topic_failover
FLAKY(PASS) ShadowLinkingReplicationTests test_auto_prefix_trimming {"source_cluster_spec": {"cluster_type": "redpanda"}, "storage_mode": "cloud", "with_failures": false} integration https://buildkite.com/redpanda/redpanda/builds/84102#019dfe6e-9250-4a40-8fd4-4e85b2938082 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0008, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_auto_prefix_trimming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants